Internet image

0.1 Problem

One common goal of the microbiome study is to infer dependence relationship between taxa. Relative abundance data of microbiota is compositional data. The classic statistical methods (e.g., Pearson correlation and Spearman rank correlation), however, are invalid for compositional (microbiome) data and leads to spurious correlations.

0.2 Solution

The review article, “Microbiome datasets are compositional: and this is not optional” (Gloor et al. 2017)summarized three valid statistical methods that deal with compositionality in microbiome relative abundance data to achieve reliable inferred correlation, including proportionality and SparCC, Spiec Easi.

In this mini-project, I focused on illustrating proportionality analysis on the real microbiome data. Lovell and co-authors (Lovell et al. 2015) proposed the proportionality method for relative correlation, as a valid alternative to absolute correlation. In 2017, Quinn and his colleagues developed the R package, propr to implement the proportionality analysis (Thomas P. Quinn et al. 2017).

Based on experiments on simulation datasets, proportionality is a precise indicator of absolute correlation, although sensitivity is limited (Thomas P. Quinn et al. 2017).

0.3 Purpose of mini-project

This mini-project is aimed to:

  • illustrate the proportionality analysis on a real dataset, using propr R package

  • validate the proportionality method on the real dataset, by checking up the sensitivity and specificity

0.4 Outline of mini-project

  1. Calculate proportionality
  2. Identify highly-proportional taxa-pairs
  3. Visualize proportionality
  4. Evaluate sensitivity and specificity
  5. Analyze differential proportionality

0.5 R scripts

library(tidyverse) 

library(devtools)
# install_github("tpq/propr")  
library(propr)
library(Hmisc)  # Spearman correlation 
library(corrplot)  # visualize correlation matrix 
library(stringr)

0.5.1 Load datasets

The microbiome OTU table and metadata was retrieved from ML Repo

# raw OTU table
raw_otu <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/refseq/otutable.txt",
                    header=T,
                    sep = "")

The raw OTU table contains 557 features (OTUs) in 281 samples.

Shorten OTU identifiers with keeping only genus and species names, using stringr package. And then accumulate counts of OTUs at the species level.

head(raw_otu$X.OTU) 
## [1] "NR_147401.1_Sutterella_massiliensis_strain_Marseille-P2435_16S_ribosomal_RNA__partial_sequence"    
## [2] "NR_025930.1_Ruminococcus_bromii_strain_ATCC_27255_16S_ribosomal_RNA_gene__partial_sequence"        
## [3] "NR_025230.1_Megasphaera_micronuciformis_strain_AIP_412.00_16S_ribosomal_RNA_gene__partial_sequence"
## [4] "NR_024994.1_Lactobacillus_mucosae_strain_S32_16S_ribosomal_RNA_gene__complete_sequence"            
## [5] "NR_024750.1_Coprococcus_catus_strain_VPI-C6-61_16S_ribosomal_RNA_gene__partial_sequence"           
## [6] "NR_024684.1_Eubacterium_nitritogenes_strain_JCM_6485_16S_ribosomal_RNA_gene__partial_sequence"
# split strings 
otu.id <- raw_otu$X.OTU

split_otu.id <- str_split_fixed(otu.id, "_", n=6) %>% 
        as.data.frame()

# join genus and species fragments 
join_split_otu.id <- split_otu.id %>% 
        mutate(taxa = str_c(V3, V4, sep = " ")) 

taxa <- join_split_otu.id$taxa 

# merge taxa to raw otu table 
raw_otu_taxa <- cbind(raw_otu, taxa) 


# accumulate counts at the species level 
otu_count_splevel <- raw_otu_taxa %>% 
        gather(key = "sample", value = "count", -c("ID", "X.OTU", "taxa")) %>% 
        group_by(sample, taxa) %>%
        summarise(count_sp = sum(count, na.rm = T)) %>% 
        arrange(desc(count_sp)) %>%
        ungroup()

Read in the metadata from URL.

# metadata 

meta <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/task-obese-lean-all.txt",
         header = TRUE,
         sep = "") %>% 
        mutate(sample = X.SampleID, is.obese = Var) %>% 
        select(sample, is.obese)

head(meta)

The metadata contains 142 samples. Var indicates the independent, binary variable of interest, Lean and Obese.

0.5.2 2. Preprocess data frames for following proportionality analysis

Transpose the OTU table into sample x taxa format which is suitable for proportionality analysis.

otu_count_splevel2 <- otu_count_splevel %>% 
        spread(key = "taxa", value = count_sp) %>%
        column_to_rownames("sample")

Now, the resulting OTU count table at the species level contains 517 features/ species, in 281 samples.

Identify low-abundant OTUs and index them in the OTU table.

keep_index <- apply(otu_count_splevel2, 2, function(x) sum(x >= 10) >= 5)  # index of low abundant OTUs 

154 unique species/taxa remains after dropping the low-abundant taxa.

Add group variables to the resulting OTU table.

otu_count_splevel2_meta <- otu_count_splevel2 %>%
        rownames_to_column("sample") %>%
        inner_join(meta, by= "sample") %>%
        column_to_rownames("sample") 

head(otu_count_splevel2_meta) # print first rows 
# knitr::kable(
#   head(otu_count_splevel2_meta[ ,1:8], 10),
#   booktabs=TRUE,
#   caption = "A table of the first 6 rows of the OTU matrix at the species level"
# )

Check if any missing values in the OTU table. Replace missing values with zero if any.

sum(is.na(otu_count_splevel2)) 
## [1] 0
# otu_count_splevel2_rmna <- otu_count_splevel2 %>%
#         replace(is.na(.), 0)
# 
# sum(is.na(otu_count_splevel2_rmna))

0.5.3 Calculate proportionality

Calculate a proportionality measurement, rho using propr function, following the procedures and codes provided by T.P.Quinn et al. (Thomas P. Quinn et al. 2019)

# calculate the propr object 
pr <- propr(counts = otu_count_splevel2, # a count matrix with subjects as rows and features as columns
            metric = "rho", # the proportionality metric to calculate, 'rho', 'pho' or 'phs'
            ivar = "clr", # reference OTU/ feature for alr. 
            symmetrize = FALSE,
            p = 100) # permutation cycles (100, by default)  
            # select = keep_index) # filter low-abundant taxa

# pr@matrix # proportionality metrics 

# subset the propr object, dropping low-abundant taxa 
pr_sub <- subset(pr, select = keep_index)

# pr_sub@matrix # proportionality metrics of high-enough-abundant taxa 


# the alternative way to calculate rho statistics
# rho <- perb(otu_count_splevel2,
#             ivar =  "clr",
#             select= keep_index)  # subset abundant taxa after calculating proportionality on the complete matrix

Set three required arguments in the wrapper propr function as follows,

  • counts a count matrix with subjects as rows and features as columns

  • metric the proportionality metric to calculate, ‘rho’, ‘pho’ or ‘phs’

  • ivar reference OTU/ feature

For proportionality, permute false discovery rate (FDR) for a given cut-off, instead of estimating parametric P-value.

# select a good cutoff for 'rho' by permuting the FDR at various cutoffs.
# below, use [0, .05, ..., .95, 1]

pr <- updateCutoffs(pr, cutoff = seq(0, 1,.05)) 
## |------------(25%)----------(50%)----------(75%)----------|
# pr@fdr  # identify the maximal cutoff at FDR < 5% 

The table shows that the largest cutoff is 0.20 when the FDR reaches 5% or fewer.

Identify highly-proportional (rho >= 0.7) feature/ species pairs.

pr_sub_high <- pr_sub[">", .7] 
pr_sub_high 
## Not weighted and not alpha-transformed 
## @counts summary: 281 subjects by 153 features
## @logratio summary: 281 subjects by 153 features
## @matrix summary: 153 features by 153 features
## @pairs summary: 2 feature pairs
## @fdr summary: iterations
## See ?propr for object methods
pr_sub_high@pairs # index of highly proportional pairs 
## [1]  9246 17256
pr_sub_high_mt <- pr_sub_high@matrix %>% 
  as.data.frame()

high_propr_pairs <- pr_sub_high_mt %>% 
  rownames_to_column("partner") %>%
  mutate(partner = factor(partner)) %>% 
  gather(key="pair", value = "rho", -partner) %>% 
  mutate(pair = factor(pair)) %>% 
  mutate(partner_index = as.numeric(partner)) %>% 
  mutate(pair_index = as.numeric(pair)) %>% 
  filter(partner_index < pair_index) %>% 
  filter(rho > 0.7) %>% 
  select(partner, pair, rho) %>% 
  mutate(rho = round(rho, 2))


knitr::kable(high_propr_pairs, 
             booktab = TRUE,
             caption = "Highly proportional species pairs (rho > 0.7)") 
Highly proportional species pairs (rho > 0.7)
partner pair rho
Blautia luti Blautia wexlerae 0.79
Megasphaera indica Olsenella uli 0.89

0.5.4 Visualize proportionality

Visualize proportionality with a strict cutoff, using getNetwork function. The package vignette describes several built-in tools for visualizing proportionality.

getNetwork(pr, 
           cutoff = 0.7, 
           d3=F) # not display in 3D in this case 
Proportionality network based on rho matrics

Proportionality network based on rho matrics

## IGRAPH 120d145 UN-- 273 22346 -- 
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from 120d145 (vertex names):
##  [1] Natranaerovirga hydrolytica--Rothia mucilaginosa        
##  [2] Natranaerovirga hydrolytica--Syntrophococcus sucromutans
##  [3] Natranaerovirga hydrolytica--Streptococcus lactarius    
##  [4] Natranaerovirga hydrolytica--Parvimonas micra           
##  [5] Natranaerovirga hydrolytica--Lactobacillus paraplantarum
##  [6] Natranaerovirga hydrolytica--Sphingomonas koreensis     
##  [7] Natranaerovirga hydrolytica--Gemella sanguinis          
##  [8] Natranaerovirga hydrolytica--Ralstonia pickettii        
## + ... omitted several edges
# plot(pr_sub_high)

0.5.5 Interpret proportionality

How to interpret proportionality? Proportionality depends on log-transformation and must get interpreted with respect to the chosen reference.

Interpret clr-based proportionality to signify a coordination that follows the general trend of the data. In other words, these proportional OTUs move together as individuals relative to how most genes move on average (Thomas P. Quinn et al. 2019).

0.5.6 Evaluate sensitivity and specificity of the proportionality method

Apply Spearman rank correlation to raw absolute abundance matrix. Consider the resulting Spearman correlation coefficients (absolute correlation) as the true correlations between taxa pairs.

Use Hmisc package to implement Spearman rank correlation analysis and corrplot package to visualize correlation matrix.

spearman_corr <- Hmisc::rcorr(as.matrix(otu_count_splevel2), 
                              type = "spearman")   

corrplot::corrplot(spearman_corr$r, 
                   method = "color", 
                   tl.pos = "n",  # mutate text labels 
                   type="upper") 
Correlation matrix of 517 species based on Spearman rank correlation method

Correlation matrix of 517 species based on Spearman rank correlation method

# extract Spearman coefficients 
abs_corr_r <- spearman_corr$r %>% 
  as.data.frame() %>% 
  rownames_to_column("partner") %>% 
  mutate(partner = factor(partner)) %>% 
  gather(key="pair", value = "sp_coef", -partner) %>% 
  mutate(pair = factor(pair)) %>% 
  mutate(partner_index = as.numeric(partner)) %>% 
  mutate(pair_index = as.numeric(pair)) %>% 
  filter(partner_index < pair_index) %>%
  arrange(partner_index, pair_index)

# extract Spearman p-val 
abs_corr_pval <- spearman_corr$P %>% 
  as.data.frame() %>% 
  rownames_to_column("partner") %>% 
  mutate(partner = factor(partner)) %>% 
  gather(key="pair", value = "sp_pval", -partner) %>% 
  mutate(pair = factor(pair)) %>% 
  mutate(partner_index = as.numeric(partner)) %>% 
  mutate(pair_index = as.numeric(pair)) %>% 
  filter(partner_index < pair_index) %>% 
  arrange(partner_index, pair_index) 
  
# combine Spearman coefficients and p-val 
abs_corr_r_pval <- abs_corr_r 

abs_corr_r_pval$sp_pval <- abs_corr_pval$sp_pval 

# proportionality matrix (rho)
propr_rho <- pr@matrix %>%
  as.data.frame() %>% 
  rownames_to_column("partner") %>%
  mutate(partner = factor(partner)) %>% 
  gather(key="pair", value = "rho", -partner) %>% 
  mutate(pair = factor(pair)) %>% 
  mutate(partner_index = as.numeric(partner)) %>% 
  mutate(pair_index = as.numeric(pair)) %>% 
  filter(partner_index < pair_index) %>% 
  arrange(partner_index, pair_index)

# combine absolute correlation and relative correlation coefficients 

abs_rel_coef <- abs_corr_r_pval 

abs_rel_coef$rho_propr <- propr_rho$rho 

ggplot(abs_rel_coef, aes(x= sp_coef, y= rho_propr))+
  geom_jitter(alpha = 0.5) +
  stat_smooth(method = "lm")

# contingency table 
cont_matrix <- table("Actual" = abs_rel_coef$sp_coef > 0.9,
                     "Observed" = abs_rel_coef$rho_propr > 0.9)  

cont_matrix # print 
##        Observed
## Actual   FALSE   TRUE
##   FALSE 121453  11869
##   TRUE      24     40
prop.table(cont_matrix, 1)[1,1] # specificity
## [1] 0.9109749
prop.table(cont_matrix, 1)[2,2] # sensitivity 
## [1] 0.625

At the rho (proportionality matrices) cutoff of 0.9, the specificity of the proportionality is 91.1% and the sensitivity 62.5%.

0.5.7 Differential proportionality analysis

propd functions in the propr package can test whether proportionality of certain pairs change between binary experimental groups.

The package vignette explains mathematically the differential proportionality measures.

According to the package vignette, propd method compares the variance of log-ratio (VLR) for one pair across groups, considering two types of differential proportionality,

  • disjointed proportionality: proportionality of a pair holds in both groups, but the ratio between the partners changes between the groups (i.e., the slope of the proportionality changes)

  • emergent proportionality: proportionality exits in only one of the groups (i.e., the strength of the proportionality changes)

propd function estimates differential proportionality by calculating \(theta\) for all features pairs.

The function takes the following arguments as input:

  • counts a matrix of \(n\) samples (as rows) and \(d\) features (as columns)

  • group an \(n\)-dimensional vector corresponding to subject labels

  • alpha an optional argument to trigger and guide transformation

  • p the total number of permutations used to estimate FDR

counts <- otu_count_splevel2_meta %>% select(-is.obese) # count matrix, sample X taxa 

group <- otu_count_splevel2_meta$is.obese # binary classes, obese vs lean 

pd <- propd(counts, group, alpha = NA, p=100) # calculate disjointed- and emergent propr. 

According the package vignette, the propd object contains theta_d and theta_e (among others), although only theta_d is active by default. Users can easily change which theta is active the functions setDisjointed and setEmergent.

 
theta_d <- setDisjointed(pd)  # activate only disjointed diff prop 

theta_e <- setEmergent(pd)  # activate only emergent diff prop 
## Warning in setEmergent(pd): Emergent proportionality not yet validated for
## unequal group sizes.

Extract the detailed results from the above objects, including OTU pairs and differential proportionality statistics.

How to identify most important pairs based on the theta_d and theta_e? Qualitatively, the smaller the theta_d, the more important the proportional pairs are. In contrast, the bigger the theta_e, the more important the proportional pairs are.

# estimate FDR using updateCutoffs function
theta_d <- updateCutoffs(theta_d, cutoff = seq(0.91, 1, 0.001)) 
## |------------(25%)----------(50%)----------(75%)----------|
theta_d@fdr  
knitr::kable(head(theta_d@fdr, 10),
             caption = "FDRs at various rho cutoffs",
             digits = 2) 
FDRs at various rho cutoffs
cutoff randcounts truecounts FDR
0.91 23.50 0 Inf
0.91 24.12 0 Inf
0.91 26.41 1 26.41
0.91 27.29 1 27.29
0.91 28.19 1 28.19
0.92 29.58 1 29.58
0.92 32.68 1 32.68
0.92 34.90 2 17.45
0.92 41.02 2 20.51
0.92 45.57 2 22.78
# disjointed diff. proportionality 
theta_d_df <- getResults(theta_d) 

theta_d_df_sub <- theta_d_df %>% 
  select(Partner, Pair, theta, lrv, lrv1, lrv2, lrm, lrm1, lrm2) %>% 
  mutate(lrv = round(lrv,2),
         lrv1 = round(lrv1, 2),
         lrv2 = round(lrv2, 2),
         lrm = round(lrm, 2),
         lrm1 = round(lrm1, 2),
         lrm2 = round(lrm2, 2)) 

theta_d_df_sub 

In the table, theta indicates theta_d in this case. lrv1 represents log-ratio variance in the Obese group; lrv2 in the Lean group. lrm1 represents log-ratio mean in the Obese group, and lrm2 in the Lean group.

The column theta equals theta_d matrices. Values of theta are 0.91 or greater which are relatively large, indicating that almost all proportional pairs do not obviously change with experimental conditions (i.e., obese or not). In addition, FDR of all the proportional pairs are greater than 5%.

Similarly, estimate FDR for the emergent differential proportionality.

theta_e <- updateCutoffs(theta_e)
# emergent diff. proportionality 
theta_e_df <- getResults(theta_e)  

0.5.8 Disproportionality networks

Visualize the network based on disjointed differential proportionality using plot function.

Providing a value of [0,1] to the cutoff argument will index pairs based on a maximum value of theta. Alternately, providing an integer greater than 1 to the cutoff argument will index the top N pairs as ranked by theta.

Note that setting d3 = TRUE will have the rgl package render the network in 3D.

The igraph package implement visualization of the theta_d object.

net_d <- plot(theta_d, # disjointed proportionality object 
              cutoff= 300,  
              d3= F)
Disproportionality network based on disjointed proportionality

Disproportionality network based on disjointed proportionality

Each “node” represents a species/taxa while each connecting “edge” represents an indexed pair (i.e. theta < cutoff).

For disjointed proportionality networks, red edges represent proportional pairs has higher log-ratio mean in Obese compared to Lean; green edges represent higher log-ratio mean in Lean compared to Obese.

Importantly, a small number of taxa/ species participants in a large number of the top differentially proportional pairs.

Visualize the network based on emergent differential proportionality using plot function.

net_e <- plot(theta_e, # emergent proportionality object 
              cutoff=800, 
              d3=F)  # whether or not display 3-D plot   
Disproportionality network based on emergent proportionality

Disproportionality network based on emergent proportionality

For the emergent network, red edges represent an emergency of proportionality in Obese compared to Lean (i.e., sudden coordination in Obese) while green edges represent a lack of proportionality in Obese compared to Lean (i.e., no coordination in Obese).

0.6 Reference

Gloor, Gregory B., Jean M. Macklaim, Vera Pawlowsky-Glahn, and Juan J. Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8 (November). https://doi.org/10.3389/fmicb.2017.02224.
Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. 2015. “Proportionality: A Valid Alternative to Correlation for Relative Data.” Edited by Roland L. Dunbrack Jr. PLOS Computational Biology 11 (3): e1004075. https://doi.org/10.1371/journal.pcbi.1004075.
Quinn, Thomas P, Ionas Erb, Greg Gloor, Cedric Notredame, Mark F Richardson, and Tamsyn M Crowley. 2019. “A Field Guide for the Compositional Analysis of Any-Omics Data.” GigaScience 8 (9). https://doi.org/10.1093/gigascience/giz107.
Quinn, Thomas P., Mark F. Richardson, David Lovell, and Tamsyn M. Crowley. 2017. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” Scientific Reports 7 (1). https://doi.org/10.1038/s41598-017-16520-0.

0.7 Package version

The analysis was performed with R version 4.2.1. Version information of all the R packages are as follow,

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] corrplot_0.92   Hmisc_4.7-2     Formula_1.2-4   survival_3.4-0 
##  [5] lattice_0.20-45 propr_4.3.0     devtools_2.4.5  usethis_2.1.6  
##  [9] forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.5    
## [13] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.3.6  
## [17] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0   colorspace_2.0-3    deldir_1.0-6       
##   [4] ellipsis_0.3.2      htmlTable_2.4.1     base64enc_0.1-3    
##   [7] fs_1.5.2            rstudioapi_0.14     farver_2.1.1       
##  [10] remotes_2.4.2       fansi_1.0.3         lubridate_1.8.0    
##  [13] xml2_1.3.3          splines_4.2.1       cachem_1.0.6       
##  [16] knitr_1.40          pkgload_1.3.1       jsonlite_1.8.2     
##  [19] broom_1.0.1         cluster_2.1.4       dbplyr_2.2.1       
##  [22] png_0.1-7           shiny_1.7.3         compiler_4.2.1     
##  [25] httr_1.4.4          backports_1.4.1     assertthat_0.2.1   
##  [28] Matrix_1.5-3        fastmap_1.1.0       gargle_1.2.1       
##  [31] cli_3.4.1           later_1.3.0         htmltools_0.5.3    
##  [34] prettyunits_1.1.1   tools_4.2.1         igraph_1.3.5       
##  [37] gtable_0.3.1        glue_1.6.2          Rcpp_1.0.9         
##  [40] cellranger_1.1.0    jquerylib_0.1.4     vctrs_0.4.2        
##  [43] nlme_3.1-160        xfun_0.33           ps_1.7.2           
##  [46] rvest_1.0.3         mime_0.12           miniUI_0.1.1.1     
##  [49] lifecycle_1.0.3     googlesheets4_1.0.1 scales_1.2.1       
##  [52] hms_1.1.2           promises_1.2.0.1    RColorBrewer_1.1-3 
##  [55] yaml_2.3.5          memoise_2.0.1       gridExtra_2.3      
##  [58] sass_0.4.2          rpart_4.1.19        latticeExtra_0.6-30
##  [61] stringi_1.7.8       highr_0.9           checkmate_2.1.0    
##  [64] pkgbuild_1.3.1      rlang_1.0.6         pkgconfig_2.0.3    
##  [67] evaluate_0.18       labeling_0.4.2      htmlwidgets_1.5.4  
##  [70] processx_3.8.0      tidyselect_1.2.0    magrittr_2.0.3     
##  [73] R6_2.5.1            generics_0.1.3      profvis_0.3.7      
##  [76] DBI_1.1.3           mgcv_1.8-41         pillar_1.8.1       
##  [79] haven_2.5.1         foreign_0.8-83      withr_2.5.0        
##  [82] nnet_7.3-18         modelr_0.1.10       crayon_1.5.2       
##  [85] interp_1.1-3        utf8_1.2.2          tzdb_0.3.0         
##  [88] rmarkdown_2.18      urlchecker_1.0.1    jpeg_0.1-9         
##  [91] grid_4.2.1          readxl_1.4.1        data.table_1.14.4  
##  [94] callr_3.7.3         reprex_2.0.2        digest_0.6.29      
##  [97] xtable_1.8-4        httpuv_1.6.6        munsell_0.5.0      
## [100] bslib_0.4.1         sessioninfo_1.2.2